According to the US Bureau of Labor Statistics, demand for jobs in data science is at an all time high. In the next decade, the number of data science jobs is projected to increase by nearly 12 million. Data science is more in demand than ever, and for good reason; data scientists work at the intersections of mathematics, technology, business, and countless other fields to provide insights and analysis. At the same time, the field of data science is continuously evolving, and the requirements of new data science jobs are changing constantly.
With this in mind, we set out to explore what optimal data science job applicants look like. First, we wanted to look at what factors are most important for determining an applicant’s future salary, whether these factors were related to type of company, an applicant’s skills and prior knowledge, or even the job description itself. A profession’s salary is on every future employee’s mind. Everyone that has, or will ever, hold a job wants to make as much money as they can. Therefore, modeling which variables are most important to a high-paying job can help future employees determine which jobs they should apply to.
For our second question, we looked more specifically at which skills a person should learn in order to maximize their average salary as well as the amount of jobs that are available to individuals with certain skills. Knowing which skills are most important for particular jobs is beneficial to anyone as it allows them to maximize the benefits they can receive for the effort that it requires to become proficient in a skill. The goal from this exploration will be to allow each person to get a glimpse into what type of specialization will allow them to maximize their salary potential as well as ability to get a job that they find rewarding.
The data that we used was published on Kaggle.com on December 29, 2021 by a user named Nikhil Bhathi. The dataset was scrapped from Glassdoor.com, a job-searching website that also allows employees to review companies. According to Bhathi, he scraped job postings related to the position of “Data Scientist” in the USA. The dataset has 742 observations, with each observation being one job posting. The following figure shows the geographical distribution of the jobs in the dataset:
The dataset originally contained 42 columns. This number was then reduced to 27 after dropping the columns that we were not interested in. Finally, we added 3 more variables, either derived from within the data or merged from outside data, making the final data consist of 30 columns. The following table shows 14 variables in our dataset:
| Job.Title | Size | Type.of.ownership | Sector | Revenue | Hourly | Employer.provided | AvgSalary | Age | seniority_by_title | Degree | COL | AdjustedSalary | NumSkills |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data Scientist | 501 - 1000 | Company - Private | Aerospace & Defense | Medium | 0 | 0 | 72.0 | 48 | na | M | 92.9 | 775.0269 | 5 |
| Data Scientist | 10000+ | Other | Health Care | Large | 0 | 0 | 87.5 | 37 | na | M | 129.7 | 674.6338 | 1 |
| Data Scientist | 501 - 1000 | Company - Private | Business Services | Medium | 0 | 0 | 85.0 | 11 | na | M | 97.9 | 868.2329 | 5 |
The other 16 columns are for 16 different skills: Python, Spark, AWS, Excel, SQL, SAS, Keras, Pytorch, Scikit, TensorFlow, Hadoop, tableau, PowerBi, Flink, MongoDB, and Google Analytics. 1 means the corresponding skill is required, while 0 means the opposite.
To answer our first question, we wanted to fit a linear regression model to our data to predict the adjusted salaries of the jobs. The potential predictor variables include
Job.Title,Size,Type.of.ownership,Sector,Revenue,Hourly,Employer.provided,Age,seniority_by_title,Degree, andNumSkills. We decided to take a square root transformation of the response variable,AdjustedSalary, to address the rightward skewness and make the values more normal. After extensive searching and exploring, we picked 19 competing models. The models were as follows:
Model 0 to 10 were the best models for each number of predictors according to the bi-directional stepwise procedure. It used AIC as the criteria, and each of the first 10 models had the lowest AIC value compared to the other models that had the same number of predictors. Since the majority of our predictors were categorical variables, we could not ignore the interactions between them. Different combinations of the predictors could potentially have different influences on the salary of the jobs. We found three interactions that were logical to include in our models:
These interactions give further information about the position, the knowledge required, and the company of the job, respectively. Then, we used the FSA function from the rFSA package to further explore the interactions between the variables. The FSA is a function that is used for subset selection and the identification of interaction terms. It supports several model criteria, but we used the AIC and r-squared values in this project. Using the FSA function, we found three more interaction terms that were not yet identified:
After finding six different interaction terms, we found 7 more models. Models 11 to 16 were different models that contained interaction terms. We chose models that had different combinations of predictors. Some models only had variables about the company, while some others only had variables about the job position and the knowledge required.
After having all of the competing models ready, we used the k-fold cross-validation method to test our models. We decided to use this cross-validation method because we had a relatively small dataset of around 700 observations, and we wanted to get the most out of it. We used a k-value of 10, and here is the result of all of the 19 models:
The MAE and the RMSE were used to compare the models. Model 16 was the best as it had the lowest MAE and RMSE values, with Models 14 and 12 in second and third place. It was no surprise that Model 16 was the best since it had all the predictors and interaction terms. What is interesting is that the second and third best models did not have any terms about education or the number of skills required. Information about the company and job position is far more important when determining the salary of a job. However, this could be a result of not including the variables that indicate the specific skills that the jobs require. We decided not to include them in the model because we will explore the specific skills more in-depth in the second part of the project.
Looking at Model 16 and its significant coefficients, we could learn many factors that impact the salary of a job positively and negatively. Machine learning engineers get paid the most, followed by data scientists and data engineers. Senior positions are, as expected, higher-paid. Generally, companies with large revenue pay more, but if a job from this kind of company is paid hourly, its salary will greatly decrease. Interestingly, hourly-paid jobs from companies with low or medium revenue do not have the same kind of drastic difference. Companies with 51–200 and 5001–10,000 employees pay the most, and private companies also pay more than public companies, regardless of size. Positions from the government, non-profit and universities pay more than those from public and private companies. Jobs that give experience credit for Ph.D degree pays more. A good rule of thumb when trying to find a company that pays well is to always search big. Big companies, both employee and revenue wise, generally pays more than medium size companies, regardless of the sector of the company. This difference is especially huge in the finance, insurance, real estate, health care and IT field. One shoud also avoid super small companies with less than 50 employees. 51-200 employees is the ideal range if one dislikes big companies.
Lastly, we want to further learn about our model by looking at the actual vs. predicted plot:
From the plot, we see a moderate fit. The model does a fair job of predicting salaries in the mid-range, around 1,000. However, we do see that our data tends to overpredict when the true salary is low and underpredict when the true salary is high. Again, the possible explanation is probably that we did not use any information about what skills are required for each job. We did not include potentially 16 more variables in our models because it would be way too complicated. Plus, the specific skills are a topic on their own that we will explore next.
For our second question, we decided to analyze which skills were optimal for future employees to learn. We also decided to look at the number of skills employees should learn to best serve them in the future. First, we examined the average salary of learning no skills, and compared it to the average salary of jobs that required each major individual skill, which includes Python, Excel, and SAS.
| Python | meanSalary | |
|---|---|---|
| 1 | 0 | 763.048491528898 |
| 2 | 1 | 935.378336336025 |
| excel | meanSalary | |
|---|---|---|
| 1 | 0 | 890.031594669557 |
| 2 | 1 | 821.299214860342 |
| sas | meanSalary | |
|---|---|---|
| 1 | 0 | 841.243905576582 |
| 2 | 1 | 985.672722864641 |
Here, we can see that the average salary of a job with no required skills is much lower than jobs that require skills like Python or SAS. The salary listed at the value “0 Python” for example, represents the mean salary of jobs that do not require Python. “1 Python” represents the mean salary of jobs that do require Python. Surprisingly, the mean salary of jobs that require Excel is lower than the mean salary of jobs that do not require Excel–perhaps this is a signal that jobs that require Excel are less technical than other jobs, and therefore, have lower salaries.
Next, we wanted to look at which skills were most in demand. To do this, we used a bar chart to examine which skills had more job listings that required the certain skill than did not.
The bar on the right side represents the amount of jobs that required a certain skill, while the left side represented the number of jobs that did not require the skill. These charts tell us that the highest-demand skills are Python, Excel, and SQL, as these skills were sought out by a majority of job listings.
From here, we wanted to see how many skills an applicant should learn to maximize their salary.
We used this graph to plot the number of skills and mean salary. It plays out like one would expect it to–learning no skills has the lowest mean salary. However, learning two skills seemed to have the highest increase, and seemed the most reasonable. Furthermore, the confidence intervals do not overlap, proving that learning 2 skills is the most efficient way to boost one’s salary.
After looking at each combination of two-skill jobs, we found that learning Python and SAS together produces the highest expected salaries; however, we do not believe that this is the best pair of skills. We wanted to find a pair of skills that has the greatest salary, but also wanted to ensure that there was high availability for this job. In order to discover which jobs were high paying but also most available, we multiplied the annual salary by the amount of jobs that need this skill.
| Python | spark | aws | excel | sql | sas | keras | pytorch | scikit | tensor | hadoop | tableau | bi | flink | mongo | google_an | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Python | 23089 | 4665 | 3465 | 20852 | 15864 | 11499 | 0 | 0 | 0 | 1856 | 0 | 737 | 0 | 0 | 0 | 0 |
| spark | 4665 | 0 | 3684 | 0 | 734 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| aws | 3465 | 3684 | 10739 | 11399 | 464 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| excel | 20852 | 0 | 11399 | 70059 | 9907 | 0 | 0 | 0 | 0 | 0 | 0 | 549 | 0 | 0 | 0 | 1269 |
| sql | 15864 | 734 | 464 | 9907 | 9683 | 2713 | 0 | 0 | 0 | 0 | 7702 | 2367 | 0 | 0 | 0 | 0 |
| sas | 11499 | 0 | 0 | 0 | 2713 | 2952 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| keras | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| pytorch | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4575 | 0 | 0 | 0 | 0 | 0 | 0 |
| scikit | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| tensor | 1856 | 0 | 0 | 0 | 0 | 0 | 0 | 4575 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| hadoop | 0 | 0 | 0 | 0 | 7702 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| tableau | 737 | 0 | 0 | 549 | 2367 | 0 | 0 | 0 | 0 | 0 | 0 | 2985 | 0 | 0 | 0 | 0 |
| bi | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| flink | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| mongo | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| google_an | 0 | 0 | 0 | 1269 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
This matrix depicts the total money invested in these jobs: a larger number means that they are highly sought after and highly paid. After this method, we discovered that the best two skills one should learn to easily obtain a high-salary position were Python and Excel.
Continuing our exploration we wanted to help out someone who already knows one language and is deciding which language they should learn next. We created a correlation matrix to show how many jobs require a specific additional skill, given you already know a specific skill.
This correlation matrix describes all job postings, not just those that require two skills For example, as seen in the matrix below the intersection between Spark and Python is 0.85. This number represents that out of all jobs that require Spark 85% of them also require Python. If we look at the intersection between Spark and SAS the correlation matrix shows 0.06. This means that out of all the skills that require Spark, only 6% also require SAS. This is very useful information as someone who already knows spark will have a better direction to which skill they should learn next. As the correlation between Spark and Python is very high, it represents that a job seeker that already knows Spark should focus their efforts on learning Python rather than SAS, as it would increase the amount of available jobs more.
The goal of the first question was to predict the salary of a job given its relevant information about the position, the company, and the required knowledge. Our best models were Model 16, 14, and 12. We learned that without knowing what specific skills are required for a job, the number of skills required does not impact the salary significantly. Information about the company and the specific job title were more important predictors. Interactions between the variables also provided a deeper relationship between the predictors and the salary. While the single terms generally hold, the interaction terms give us a closer look at the influences of the variables. This result is valuable and relevant today because it gives job seekers ideas and information about what to look for in a job and the company that is offering it. With this information, one knows whether the characteristics of a company could potentially have a positive or negative influence on the salary. This kind of information is crucial for job seekers, especially inexperienced students right out of college, as it saves them energy and time, which are two of the most valuable things for people looking for jobs. This information can also give employers insight into the current or future job position of the company. Using the model, employers can get a sense of the average salary of a particular job type at a company on the market. Employers can adjust the salary according to the “expected” salary at their company.
The goal of question two was to narrow down which skills were best to learn to maximize job salary. Here, we found that learning two skills together proved to have the highest salary outcomes, and among the two skill combinations, Python and SAS were the best to learn together to maximize salary. Although Python and Excel have a lower estimated salary, there is more total money invested in this position, proving it to be a highly available and well-paying job. However, if a person were to learn a single skill, they were best learning Python, Excel, or SQL. These results can serve as a guide for future applicants looking to maximize their salary by learning one skill.
To improve our model, we could gather more job postings to increase the accuracy of our predictions. Another way to improve our model is to broaden our investigation into more than two skills. Right now, our questions help job-seekers figure out which jobs are available, and which two skills they should learn to get paid the most. But now the question is: what happens after someone learns two skills? One way our investigation could be continued is by creating a model that would allow you to select the first two skills you already know. It could then provide a matrix for which skill you should learn next to maximize your salary, or to increase your job availability. This would be important for graduate students, or early employees looking to switch jobs.
Another problem with our model is that acquiring a job is not only about the “hard-skills”. The interviewing process is long and requires a lot of interpersonal and communication skills. Our model does not account for any of these soft skills, or past experiences such as leadership positions and community involvement. A model that depicts a more holistic depiction of what employers are looking for such as “strong leadership skills” or “self-starter” may prove useful to help students gain an upper hand in these interviews. One way we could do this is by making all of the required bullets different arrays and then analyzing these arrays to see which soft-skills most employers are looking for.